library(tm)
library(tidytext)
library(tidyverse)
library(DT)
library(wordcloud2)
library(word2vec)
library(ggplot2)
library(plotly)
library(uwot)
In our daily life, there are essentially some happy moments that make us feel fulfilled and energetic. In this project, we will explore these moments from several aspects based on the Natural Language Processing (NLP) and produce a data story about 1.
The data we used is originated from the HappyDB which is a database records a corpus of 100,000 crowd-sourced happy moments. Before the analysis, we should first conduct the data processing to convert the data into forms that are more fitted to be analyzed. Thankfully, the file Text_Process.csv had already generated the processed data set. So we would use the data set directly.
First, we should briefly see the data set’s structure and the variables contained in the data set.
dat <- read.csv('../output/processed_moments.csv')
head(dat,5)
## hmid wid reflection_period
## 1 27673 2053 24h
## 2 27674 2 24h
## 3 27675 1936 24h
## 4 27676 206 24h
## 5 27677 6227 24h
## original_hm
## 1 I went on a successful date with someone I felt sympathy and connection with.
## 2 I was happy when my son got 90% marks in his examination
## 3 I went to the gym this morning and did yoga.
## 4 We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.
## 5 I went with grandchildren to butterfly display at Crohn Conservatory\n
## cleaned_hm
## 1 I went on a successful date with someone I felt sympathy and connection with.
## 2 I was happy when my son got 90% marks in his examination
## 3 I went to the gym this morning and did yoga.
## 4 We had a serious talk with some friends of ours who have been flaky lately. They understood and we had a good evening hanging out.
## 5 I went with grandchildren to butterfly display at Crohn Conservatory\n
## modified num_sentence ground_truth_category predicted_category id
## 1 TRUE 1 <NA> affection 1
## 2 TRUE 1 <NA> affection 2
## 3 TRUE 1 <NA> exercise 3
## 4 TRUE 2 bonding bonding 4
## 5 TRUE 1 <NA> affection 5
## text
## 1 connected date successfully sympathy
## 2 examination marks son
## 3 gym morning yoga
## 4 evening flaky friend hang talked understood
## 5 butterfly conservatory crohn display grandchildren
colnames(dat)
## [1] "hmid" "wid" "reflection_period"
## [4] "original_hm" "cleaned_hm" "modified"
## [7] "num_sentence" "ground_truth_category" "predicted_category"
## [10] "id" "text"
We could find that the variable predicted_category contain the predicted category for each data. Below, we would do some analysis relevant to this variable.
table(dat$predicted_category)
##
## achievement affection bonding enjoy_the_moment
## 33897 34164 10726 11109
## exercise leisure nature
## 1196 7457 1843
Through the table, we observe that the achievement and affection make up most fractions in the categories of the recorded moments. As a result, we believe that these two kind of feeling bring most happiness to people.
Above we briefly have a glance through the data. Here, we want to further explore the sentiment of the recorded sentences and extract the feeling that make people happy. Here, we first get the numbers of different words.
dat_word <- dat %>%
unnest_tokens(word, text)
dat_word_count <- dat_word %>%
count(word, sort = TRUE)
head(dat_word_count,5)
## word n
## 1 friend 10892
## 2 day 9930
## 3 time 9692
## 4 family 4692
## 5 watched 4385
After getting a bag of counts of different words, we use it to generate our first word cloud.
wordcloud2(dat_word_count[which(dat_word_count$n>300),])
Above we have generated the first word cloud. We could observe that the words like friend, day, home, etc. Though we could get many information from this graph, we still notice that there are some clear disadvantages about it. For instance, the word cloud contains too many words, making it looks quite overwhelming. To reduce the complexity of it, we only keep the words that appear more than 2,000 times in the book.
dat_word_count_l2k <- dat_word_count[which(dat_word_count$n>2000),]
wordcloud2(dat_word_count_l2k)
By limiting the size of each word, we are able to generate a more understandable word cloud containing only the most important words.
By now, we have processed the data, briefly glanced over the data structure, and generated the basic word cloud showing the frequently-appeared words. Now, what we are interested is that how we find the connection between words? Are there any similarities within the words? And how we could find people’s probable interest given some triggers of his/her happiness. Here, we would use the package word2vec to carry out our works.
dat_vec <- dat
set.seed(4243)
model1 <- word2vec(x = dat_vec$text, dim = 15, iter = 20)
Now, we have the model1 trained for embedding. Than we could try to type in words we are interested in and find their nearest words. For instance, I am quite interested in game and basketball and wondering the similar things that could make me happy as well.
embed <- predict(model1, c("game", "basketball"), type = "nearest", top_n = 10)
embed
## $game
## term1 term2 similarity rank
## 1 game franchise 0.9045113 1
## 2 game league 0.9023066 2
## 3 game hockey 0.9016669 3
## 4 game football 0.8996831 4
## 5 game fifa 0.8943493 5
## 6 game gt 0.8917541 6
## 7 game baseball 0.8864343 7
## 8 game hearthstone 0.8861898 8
## 9 game ii 0.8798506 9
## 10 game gamer 0.8745061 10
##
## $basketball
## term1 term2 similarity rank
## 1 basketball baseball 0.9707909 1
## 2 basketball cubs 0.9377135 2
## 3 basketball beat 0.9207459 3
## 4 basketball hockey 0.9204223 4
## 5 basketball defeated 0.9136161 5
## 6 basketball championship 0.9119426 6
## 7 basketball ping 0.9061303 7
## 8 basketball guild 0.9041461 8
## 9 basketball champions 0.8999110 9
## 10 basketball football 0.8979672 10
The above list 20 other things that could probably make me feel happy given that I am a game fan and a basketball fan. I might try these things by myself.
Than, we could draw an interactive plot to help us map the words we are interested in.
model2 <- word2vec(x = dat_vec$text, dim = 15, iter = 20)
embed_plot <- as.matrix(model2)
viz <- umap(embed_plot, n_neighbors = 15, n_threads = 2)
df_tmp <- data.frame(word = rownames(viz),
x = viz[, 1], y = viz[, 2],
stringsAsFactors = FALSE)
plot_ly(df_tmp, x = ~x, y = ~y, type = "scatter",
mode = 'text', text = ~word)
This graph looks quite messy, but we could put the pointer in the points we want to see the words.
We all know that there is a famous analogy that: \[king-man+women=queen\] Similarly, we want to know the most probable triggers of happiness for a person with friends likes to play game but does not like basketball. What we need to do is to predict the final position and find the nearest words around it.
tmp <- predict(model2, newdata =
c('friend','game','basketball'),
type = 'embedding')
tmp_form <- tmp['friend',] + tmp['game',] - tmp['basketball',]
predict(model2, newdata = tmp_form,
type = "nearest", top_n = 10)
## term similarity rank
## 1 iave 0.9998181 1
## 2 hey 0.9986129 2
## 3 internet 0.9983231 3
## 4 parents 0.9981050 4
## 5 mine 0.9974617 5
## 6 iaam 0.9971441 6
## 7 human 0.9965062 7
## 8 friendship 0.9963590 8
## 9 hard 0.9950250 9
## 10 imagine 0.9940613 10
Below we have conducted several analysis based on the HappyDB. However, here are still some more works that could be done remaining. For instance, we could generate a Neural Network by limiting the minimum occurrence of words to reduce the complexity. We could use it to determine to whom the input triggers most likely to belong.